EDA - Final Project by Xu Ren

Univariate Plots Section

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

# Univariate Analysis

What is the structure of your dataset?

This is a clean dataset by Cortez et. al in 2009. The dataset is clean and the only variable that contains values of zero is “citric.acid”, which is plausible and can be trusted.

What is/are the main feature(s) of interest in your dataset?

Given my limited knowledge about red wine, my experience would tell me that the alcohol content, acidity (pH), and the amount of sulphates may have a real impact on the quality of a particular wine. Additional features that may be of intrest are chlorides, citric.acid, or residual.sugar. We will also explore the relationship of volatile.acidity, fixed.acidity, and sulfur.dioxide. Later in our data exploration, we may also create bins for more qualitative data analysis.

Did you create any new variables from existing variables in the dataset?

Lastly, I created a binary field for “good” and “bad” wines. I am stepping outside of my role as a data explorer here briefly and stepping into my role as a potential consumer of red wines. It is my opinion that as a consumer, I am less interested in trying to predict the “value” of the quality variable (i.e. whether a wine is a 3 versus a wine is a 7) but more interested in knowing the likelihood of whether a wine will be good or not. Therefore, I decided to bin the wine quality variable into a logistic (binary) variable so that I can later perform a logistic regression for classification or a decision tree. This will allow me, the consumer, to perhaps narrow down a list of wines that I may be considering for my next party to those that are most likely to be considered “good.” To me, this is much more useful!

Bivariate Plots Section

## Warning in loop_apply(n, do.ply): Removed 117 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I was very surprised that pH did not seem to have a significantly strong distinction between the “good” wines versus the “bad” wines. When looking at the quality variable, it did appear that higher quality rated wines were slightly more acidic, but this relationship was not nearly as strong as I assumed. I was very surprised that residual.sugar did not have as strongly an effect on “good” versus “bad” wines either.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It was very informative to have noticed that the boxplots for citric.acid and volatile.acidity showed a relatively strong distinction between the “good” and “bad” wines. Based on my prior (limited) knowledge of wines, I would not have known the importance of these features. However, I do know about the importance of alcohol content, most likely due to the fact that common cultural lingo often refers to wines and spirits in relation to their “percent of alcohol.” I noticed that the “good” wines tend to have higher levels of alcohol content, for that stronger, fuller body taste.

What was the strongest relationship you found?

I am most interested in the boxplot comparing the median values of alcohol between “good” and “bad” wines. I saw the the median value for alcohol among the “good” wines were quite higher than those of the “bad” wines. However, the alcohol content of the “bad” wines were also more volatile, with greater variance between the maximum and minimum values.

Multivariate Plots Section

## 
## Calls:
## logm1: glm(formula = good.bad ~ alcohol, family = binomial, data = wine_analysis)
## logm2: glm(formula = good.bad ~ alcohol + volatile.acidity, family = binomial, 
##     data = wine_analysis)
## logm3: glm(formula = good.bad ~ alcohol + volatile.acidity + sulphates, 
##     family = binomial, data = wine_analysis)
## logm4: glm(formula = good.bad ~ alcohol + volatile.acidity + sulphates + 
##     chlorides, family = binomial, data = wine_analysis)
## logm5: glm(formula = good.bad ~ alcohol + volatile.acidity + sulphates + 
##     pH, family = binomial, data = wine_analysis)
## 
## ============================================================================
##                         logm1      logm2      logm3      logm4      logm5   
## ----------------------------------------------------------------------------
## (Intercept)           -10.763*** -8.344***  -9.720***  -9.341***  -9.310*** 
##                        (0.683)   (0.726)    (0.784)    (0.794)    (1.429)   
## alcohol                 1.056***  1.005***   0.997***   0.950***   1.003*** 
##                        (0.067)   (0.069)    (0.069)    (0.070)    (0.071)   
## volatile.acidity                 -3.541***  -3.122***  -2.981***  -3.090*** 
##                                  (0.355)    (0.365)    (0.368)    (0.376)   
## sulphates                                    1.873***   2.517***   1.852*** 
##                                             (0.369)    (0.437)    (0.375)   
## chlorides                                              -4.297**             
##                                                        (1.434)              
## pH                                                                -0.143    
##                                                                   (0.418)   
## ----------------------------------------------------------------------------
## Aldrich-Nelson R-sq.      0.177      0.222      0.233      0.236      0.233 
## McFadden R-sq.            0.156      0.207      0.219      0.224      0.219 
## Cox-Snell R-sq.           0.194      0.248      0.261      0.266      0.261 
## Nagelkerke R-sq.          0.258      0.332      0.349      0.355      0.349 
## phi                       1.000      1.000      1.000      1.000      1.000 
## Likelihood-ratio        343.939    456.470    484.482    493.959    484.599 
## p                         0.000      0.000      0.000      0.000      0.000 
## Log-likelihood         -932.517   -876.251   -862.245   -857.507   -862.187 
## Deviance               1865.034   1752.503   1724.491   1715.014   1724.374 
## AIC                    1869.034   1758.503   1732.491   1725.014   1734.374 
## BIC                    1879.788   1774.634   1753.999   1751.900   1761.260 
## N                      1599       1599       1599       1599       1599     
## ============================================================================
## 
##   0   1 
## 744 855
## predictions
##   0   1 
## 788 811
##            
## predictions   0   1
##           0 550 238
##           1 194 617
## [1] 0.7298311
##                 
## tree.predictions   0   1
##                0 598 253
##                1 146 602
## [1] 0.750469

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Again, alcohol content, even as it related to other feature variables, was the most strongly explanatory for quality. I was most interested when alcohol was plotted in conjunction with sulphates, volatile.acidity, density, and chlorides as they appeared to further strengthen the decision boundary between “good” and “bad” wines.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a few logistic regression models as well as a decision tree model. First, it must be stated that this is not a true machine learning exercise. No attention was given to tuning, cross-validation, and pruning/penalization. The logistic regression model was useful in seeing what feature variables really affected the binary “good” and “bad” wines. This is useful to me as a consumer of wine, especially as I am often limited in the information I can gather in a very short period of time at the store. However, as a human, I also cannot easily implement a logistic regression model at the store, so I decided to also use a decision tree model. The decision tree is much more helpful for my next trip to the wine store; moreover, the decision tree captures interaction between the features that the logistic regression did not.


Final Plots and Summary

Plot One

Description One

This boxplot places “bad” and “good” wines next to each other and their respective descriptive values for alcohol content. “Good” wines appeared to have higher levels of alcohol content, though the content level of “bad” wines are more dispersed.

Plot Two

Description Two

I found the interaction between alcohol content and sulphates together to be most interesting in delineating between “bad” and “good” wines. The scatter as well as the density plot highlight this differentiation.

Plot Three

Description Three

Though not strictly an exploratory plot, this decision tree shows the importance of the various features as well as a guide for purchasing wine. Furthermore, it highlights the interaction between alcohol, sulphates, and volatile acidity.


Reflection

The Wine Quality dataset is a tidy dataset of 1,599 observations. It is made public for research by P. Cortez et al. I intentionally did not read the paper “Using Data Mining for Wine Quality Assessment” until after the final project, so as to prevent biases and to fully grapple the dataset on my own. I started by understanding the distribution of the quality of wines in the dataset. I then created a binary variable for labeling “good” and “bad” wines simply on whether the quality was greater than 5 or less than equal to 5. I then explored various features as well as combinations of features as they relate to the “good” and “bad” label. I was most positively suprised that chlorides, pH, residual sugar, and citric acid all did not have a stronger relationship to the quality label. Ultimately, I created a few logistic regression models as well as a decision tree model using all observations in the dataset. There are quite a few limitations to this model. First, I have simiplified the quality label into only two classes. As a consumer of wines, I felt that this was sufficient for my next trip to the store. Furthermore, the dataset is based on sensory judging, which may vary significantly from individual to individual. Nonetheless, the decision tree may provide a simple guide for deciding the next wine to try. It would also be helpful to have datasets from different time periods or different judges to see how “accurate” the models can still perform.